GBE 



Reconstructing the Evolutionary History of Transposable 
Elements 

Arnaud Le Rouzic 1 '*, Thibaut Payen 1,3 , and Aurelie Hua-Van 1,2 

1 Laboratoire Evolution, Genomes, Speciation, CNRS-LEGS-UPR9034, CNRS-IDEEV-FR3284, Gif sur Yvette, France 
2 Universite Paris-Sud 11, Faculte des Sciences, Orsay, France 

3 Present address: UMR INRA/UHP, Interactions Arbres/Micro-Organismes, INRA-Nancy, Champenoux, France 
^Corresponding author: E-mail: lerouzic@legs.cnrs-gif.fr. 
Accepted: December 17, 2012 



Abstract 

The impact of transposable elements (TEs) on genome structure, plasticity, and evolution is still not well understood. The recent 
availability of complete genome sequences makes it possible to get new insights on the evolutionary dynamics of TEs from the 
phylogenetic analysis of their multiple copies in a wide range of species. However, this source of information is not always fully 
exploited. Here, we show how the history of transposition activity may be qualitatively and quantitatively reconstructed by considering 
the distribution of transposition events in the phylogenetic tree, along with the tree topology. Using statistical models developed to 
infer speciation and extinction rates in species phylogenies, we demonstrate that it is possible to estimate the past transposition rate of 
a TE family, as well as how this rate varies with time. This methodological framework may not only facilitate the interpretation of 
genomic data, but also serve as a basis to develop new theoretical and statistical models. 
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Introduction 

As transposable elements (TEs) have no systematic role in gen- 
omes beyond their own perpetuation, they are generally con- 
sidered as selfish DNA sequences (Doolittle and Sapienza 
1980; Orgel and Crick 1980). Nevertheless, their activity con- 
sisting in self-promoting mobility and duplication has notice- 
able consequences on host genomes, including mutation, 
recombination, change in genome size, and modification of 
the regulation patterns (Hua-Van et al. 201 1). They are virtu- 
ally universal, and they probably have existed since the origin 
of life; describing the dynamical properties of TEs thus appears 
as a necessary step toward a better understanding of genome 
evolution (Lynch 2007). 

The short- and long-term dynamics of TE families in their 
host genome has generated a significant amount of theoret- 
ical work in population and evolutionary genetics (Charles- 
worth B and Charlesworth D 1983; Charlesworth 1991; 
Charlesworth et al. 1994; Le Rouzic and Deceliere 2005). 
Population genetic models and simulations confirm that para- 
sitic TEs could realistically invade and maintain for a long time 
in sexual populations. Theoretical approaches have also sug- 
gested that several long-term scenarios were possible, includ- 
ing the loss of all copies, or the persistence of TE activity, either 



as a transposition-selection equilibrium, or as a succession of 
burst and decay stages (Charlesworth B and Charlesworth D 
1983; Le Rouzic and Capy 2006). Unfortunately, empirical 
insights remain scarce and information about TE dynamics in 
genomes, such as changes in the transposition rate or correl- 
ations between different TE families, do not cover enough 
species nor enough TE families to provide broad and general 
inference about genome evolution. The recent improvement 
in sequencing technology, as well as the availability of the 
corresponding data in public databases, makes it possible to 
anticipate significant progress on these issues. Yet, an import- 
ant factor limiting the exploration of genome evolution re- 
mains the availability of efficient statistical and analytical 
tools able to extract meaningful and synthetic information 
from such a large amount of data. 

As a consequence of their propensity to duplicate, TEs are 
present as multiple copies in genomes. The number of copies 
varies according to the TE family and the host species, from a 
very few insertion sequences in bacterial genomes (Chandler 
and Mahillon 2002) to hundreds of thousands of LINE and 
SINE elements in human (Lander et al. 2001). For RNA- 
intermediate elements (class I), duplication is directly induced 
by the "copy-and-paste" transposition mechanism, whereas 



© The Author(s) 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.Org/licenses/by-nc/3.0A which 
permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 



Genome Biol. Evol. 5(1)77-86. doi:10.1093/gbe/evs130 Advance Access publication December 29, 2012 



77 



Le Rouzic etal. 



GBE 



for DNA "cut and paste" transposons (class II elements), du- 
plication arises indirectly via DNA replication and repair 
(Wicker et al. 2007). In any case, a transposition event may 
generate a duplicated copy, inserted into a new genomic site, 
with a sequence that is identical to the original element. From 
this point, copies accumulate mutations independently, and 
their divergence increases with time. 

Reconstructing the phylogeny of TE copies from the 
genome sequence of an individual could thus be used as a 
basis to infer the evolutionary history of a TE family in the 
whole species, and represents a rich source of information 
about genome evolution (Kazazian 2004; Ray et al. 2009; 
Biemont 2010). With this article, we intend to describe a 
simple and satisfactory methodological framework to infer 
TE evolutionary history in genomes, based on the birth- 
death models that have been developed to infer speciation 
and extinction rates in phylogenies (Yule 1924; Kendall 1948; 
Nee et al. 1994). We then discuss how to interpret the distri- 
bution of TE activity in the context of existing theoretical 
models. 

Materials and Methods 

Transposition Model 

Several evolutionary mechanisms are involved in the variation 
of the copy number in genomes. The number of elements 
increases by replicative transposition, which explains the 
maintenance of the genomic parasite. The transposition rate 
is not necessarily constant, it may be affected by various 
regulation mechanisms, or by the progressive loss of transpos- 
ition activity by mutation accumulation on TE sequences. 
Meanwhile, copies can be lost by different processes, includ- 
ing transposition-related or spontaneous deletion. Natural 
selection may also affect TE copy number: by assuming a de- 
crease in fitness associated to copy accumulation, individuals 
with less copies will reproduce more efficiently, thus reducing 
the average copy number at the next generation. 

Formal population genetic models of TEs stem from the 
early 1980s (Hickey 1982; Charlesworth B and Charlesworth 
D 1983), see Charlesworth et al. (1994), Le Rouzic and Decel- 
iere (2005), and Lynch (2007) for review. Even if more elabo- 
rated models (often not tractable analytically) have been 
developed since then (Quesneville and Anxolabehere 1998; 
Le Rouzic and Capy 2005; Dolgin and Charlesworth 2006; 
Le Rouzic et al. 2007), we will stick here to the simpler frame- 
work described in Charlesworth B and Charlesworth D (1 983), 
predicting the dynamics of the average number of copies per 
genome (n) as: 

n t+1 ~n t -(1 + u t -v), (1) 

where u t is the replicative transposition rate at time t, and v is 
the deletion rate. In this setting, all parameters are considered 
as constant, except the transposition rate u t that can change 
with time. For simplicity, the impact of natural selection, which 



tends to decrease the probability of fixation of deleterious 
copies, is here considered together with transposition regula- 
tion, and thus included in u t . In the simulations, all copies are 
able to transpose (which does not necessarily mean that they 
are all capable of producing the transposition machinery). 

To use this setting in a phylogenetic context, two assump- 
tions are necessary. First, in the original setting of Charles- 
worth B and Charlesworth D (1983), time steps were 
standing for generations. At an evolutionary scale, the trans- 
position dynamics has to be assimilated to a continuous pro- 
cess, u and v becoming transposition and deletion rates per 
time unit. Second, the phylogenetic inference is generally 
drawn from a single sequenced genome, and the recent 
population process is ignored. The ancestral lineage of the 
sequenced individual is thus assumed to be representative of 
the whole species (i.e., recent transposition events could be 
different in another lineage, but their dynamics should be 
similar). 

Birth-Death Models 

A birth-death model describes a stochastic branching process 
in which branches can split or disappear in the course of time. 
In traditional phylogenetic analyses, branch splitting events 
correspond to speciations, and dead branches correspond to 
species extinctions. Here, we propose to use the same frame- 
work, with a different interpretation: splitting branches are 
duplication (transposition) events (followed by the fixation of 
the duplicated copy), and extinct branches feature deletion 
events (followed by the fixation of the deleted allele). 

The simplest model involves only birth events with a con- 
stant rate (using the notation presented in the previous sec- 
tion, u t = u and v = 0), which describes a "pure birth" model 
or Yule process (after Yule 1924). Branch extinctions (v > 0) 
can be included in a more complex branching process as in 
Kendall (1948), but application to statistical inference must 
account for the fact that a splitting event can be noticed in 
a phylogeny only if both lineages maintain up to the present 
time. According to Nee et al. (1 994), the waiting time t before 
the next observable splitting event is described by the follow- 
ing equation: 

Prob(f|u,\/) = P S p| it xP obs , (2) 

where P sp | it is the probability for a splitting event, which fol- 
lows an exponential distribution, and P obs the probability of 
observing this splitting event from survivor branches. The 
model is usually reparameterized with r = u — v, the net 
diversification rate, and a = v/u, the extinction fraction 
(Rabosky 2006). The expression of these probabilities, as 
well as the corresponding likelihood function, can be found 
in, for example, Nee et al. (1994). Maximizing this likelihood 
function numerically allows to get estimates for r and a (and 
thus for u and v). 
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Several extensions or alternatives to this model have been 
developed to account for smooth or rapid changes in diversi- 
fication and/or extinction rates (Rabosky 2006; Stadler 201 1). 
Here, we explored four models, available as contributed pack- 
ages in R (version 2. 1 4) (R Development Core Team 201 1 ): the 
"pure birth" model, implemented in the function yuleO from 
the package "ape" version 3.0^ (Paradis et al. 2004), the 
"birth-death" model from the function bd(), package "laser" 
version 2.3 (Rabosky 2006), the exponential change in birth 
rate (u t = u 0 e~ kt , k being the rate of the change) from a 
modified version of function fitSPVARO in "laser," and the 
diversity-dependence model from function dd_ML() in pack- 
age "DDD" version 1.2 (Etienne and Haegeman 2012), in 
which u = uo - (uo - v)n/K (K being the diversity depend- 
ence parameter). Changes in fitSPVARO include 1) the possi- 
bility to fit negative k values (increase in diversification rate 
with time) and 2) setting the extinction rate to 0. The corres- 
ponding code and scripts are available on demand. Support 
intervals of parameters were estimated from 100 boot- 
strapped trees (95% central values of the bootstrapped par- 
ameter distribution). 

Tree Imbalance 

Another meaningful piece of information that can be ex- 
tracted from TE phylogenetic analysis is related to the balance 
(or imbalance) of the trees. In a perfectly balanced tree, all 
branches duplicate once, while the most unbalanced tree cor- 
responds to the situation where all duplications happen in the 
same branch. In a TE-related context, balanced trees arise 
when all copies can duplicate at the same rate, while unba- 
lanced trees correspond to "master copy" models when only 
one copy in the genome is able to transpose. Being able to 
quantify the balance of TE phylogenetic trees may thus lead to 
meaningful insights on transposition history. 

The definition of mathematical and statistical tools to esti- 
mate phylogenetic tree imbalance has generated a significant 
amount of literature that cannot be explored here (see e.g., 
Kirkpatrick and Slatkin 1 993; Aldous 2001 ; Blum and Frangois 
2006). We focused on a classical imbalance index, the (3 index. 
Index estimation by maximum likelihood (ML) and statistical 
analyses were performed with the package "apTreeshape" 
version 1 .4-5 (Bortolussi et al. 2006) for R. 

Interestingly, there is no general definition of balanced 
random trees. The literature reports two traditional models 
of random trees, the "Proportional to Distinguishable 
Arrangements" (PDA) model (assuming a uniform probability 
for all tree shapes), and the "Equal Rate Markov" (ERM) 
model, which corresponds to trees generated by a Yule pro- 
cess. Trees generated under the ERM model have a p index of 
0, whereas PDA trees are characterized by (3 = -1 .5. The p 
index can thus be interpreted along the following scale: imbal- 
anced trees (-2 < p < -1 .5), random trees (-1 .5 < p < 0), 



and trees which are too perfectly balanced to be random 
(0 < p < oo). 

Simulations 

Stochastic simulations were run to provide reference dynamics 
for interpretation. Simulations consider a unique genome 
reproducing clonally (the "average genome" of the species), 
and for simplicity, time steps are discrete. TE copies are fol- 
lowed individually and their pedigree is stored by the simula- 
tion program. The deletion rate v per time step is constant, 
and the transposition rate u t can vary with time arbitrarily. The 
system evolves according to equation (1): every time step, 
x<\ ~ V(n t • u t ) new elements are created (all elements 
having equal probabilities of being the master copy; V(x) 
stands for the Poisson distribution of mean x), and 
x 2 ~ B(n t ,v) are randomly removed (B(N,p) stands for the 
Binomial distribution). Distance matrices and phylogenetic 
trees were reconstructed from the exact evolutionary relation- 
ships between elements (no further stochasticity is introduced 
to mimic the accumulation of mutations). Simulations were 
run for 30 time steps with four sets of parameters: 1) 
u = 0.109 and v = 0, 2) a = 0.159 and v = 0.05, 3) 
u 0 = 0.1 -> i/ 30 = 0.219 and v = 0.05, and 4) 
u 0 = 0.219 -> i/ 30 = 0.1 and v = 0.05 (the -> symbol 
representing a linear change with time). These parameters 
were chosen so that the expected number of copies after 
30 time steps should be 20. Simulations started with a 
unique copy, and 1,000 runs in which the final copy 
number was between 15 and 25 were kept for each para- 
meter set. 

The Fot Elements in Fusarium 

We used real genomic data from a recent work by Dufresne 
et al. (201 1 ) to illustrate this theoretical framework. Fot TEs are 
Tc1 -mariner-pogo elements found in filamentous fungi. Four 
subfamilies extracted from the genome sequence of Fusarium 
oxysporum were selected for their average number of inde- 
pendent copies (a few dozen): Fot2 (28 copies), Fot3 
(46 copies), Fot5 (1 45 copies), and Fot6 (38 copies). Duplicates 
with homologous flanking regions, corresponding to transpo- 
sition-unrelated mechanisms (e.g., segmental duplication), 
have been removed from the data set (only one copy is ran- 
domly kept for each set of duplicates). Further details are 
provided in Dufresne et al. (201 1). 

The phylogenetic analysis was performed in R (version 2.14) 
(R Development Core Team 2011), using packages ape 
(Paradis et al. 2004) version 3.0-4 and phangorn (Schliep 
201 1) version 1 .6-3. An ML phylogeny was derived for each 
Fot family, using a GTR + G (Gamma) model of substitutions. 
Trees were rooted with elements from other families. Ultra- 
metric trees were calculated from the ML trees (without the 
outgroup) using the "pathd8" method (Britton et al. 2007), 
which happened to give visually more convincing results than 
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penalized likelihood (Sanderson 2002), or mean path length 
(Britton et al. 2002), perhaps because of the uneveness of the 
evolutionary rates across branches. Reproducing the analysis 
with mean path length ultrametric trees provide very similar 
results (not shown). 

Results 

Interpretation of Phylogenetic Patterns 

In this article, we propose to quantify transposition activity 
over time from the distribution of transposition events. The 
steps required for such an analysis consist in 1) reconstructing 
the phylogeny of TE sequences from a clean and exhaustive 
sequence data set of the TE family in the studied genome, 
from which duplicates (copies gained by other mechanisms 
than transposition, e.g., polyploidization or segmental dupli- 
cation) are removed, 2) estimating the age of the visible trans- 
position events, corresponding to the nodes in the tree, and 
3) inferring the past transposition dynamics from the branch- 
ing pattern. 

Simulation results illustrate how the divergence between 
homologous TE sequences reflects meaningful information 
about the transposition dynamics in this TE family. Transposi- 
tion is an exponential process: if the transposition rate per 
copy is constant (fig. ^A) I the number of new transpositions 
increases with the copy number (fig. IB). As a result, a con- 
stant transposition rate mainly generates recent copies. One of 
the most convenient visualization tool is the "lineage through 
time" (LTT) plot, displaying the increase in the number of 
branches in the tree with time (figs. 1 Cand 2). An exponential 
increase of the number of lineages with time (linear trend on a 
logarithmic LTT plot) reflects a "pure birth" process with a 
constant transposition rate and no deletion. Departure from 
this linear pattern denotes deletions or changes in the trans- 
position rate and can be used as a basis for parameter 
estimation. 

Application to the Dynamics of Fot Elements in 
F. oxysporum 

Four subfamilies of Fot elements, numbered Fot2, Fot3, Fot5, 
and Fot6, were retrieved from the genome of the filamentous 
fungus F oxysporum, as described in Dufresne et al. (201 1). 
All of these TE families are ancient families, elements display- 
ing genetic distances up to 35%. In all four subfamilies, recent 
transposition events (identical or nearly identical sequences 
inserted in nonhomologous positions) were detected, sug- 
gesting that they are all still active. ML phylogenetic trees 
suggest important changes in the molecular evolutionary 
rates in some branches, most of them corresponding to 
repeat-induced-point mutations, a fungus-specific (but not 
very active in F. oxysporum) defense mechanism against selfish 
DNA (Cambareri et al. 1989; Galagan and Selker 2004). This 
may lead to poor temporal estimates for some nodes, but 
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Fig. 1. — Single simulation of the temporal dynamics of a TE family 
with a constant transposition rate {u = 0.1 per copy and per time step), 
and no deletion ("pure birth" model). Xaxes are oriented from past to 
present in reconstructed dynamics {A, B, Q (x = 0 corresponds to the start 
of the transposition history, each bar stands for four successive genera- 
tions). With a constant transposition rate per copy (dashed line on A), the 
number of copies increases exponentially. This increase is reflected by the 
log-linear pattern of the LTT plot (O, which can be used as a basis for 
reconstructing the dynamics of the TE family. 

most copies remain unaffected, making further analysis on 
ultrametric trees (fig. 3) meaningful. 

Branch lengths estimated by ML are corrected for multiple 
mutations, and are thus expected to be proportional to the 
evolutionary distance, assuming some approximative molecu- 
lar clock. As all sequenced elements are present in the gen- 
omes of modern species, all the tips should be aligned when 
the tree scales with time: the corresponding ultrametric trees 
were obtained by the "pathd8" method, after removal of the 
outgroups (see Materials and Methods). We first applied a 
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Fig. 2. — Simulated LTT plots in four scenarios. Each line is the average 
over 1 ,000 replicates. The pure birth model corresponds to a transposition- 
only model; the birth-death model features both transpositions and dele- 
tions; and the increasing and decreasing birth models represent linear 
changes in the transposition rate (see Materials and Methods for details). 
Different transposition dynamics generate different LTT profiles, illustrating 
how the branching pattern from phylogenetic trees can be used to esti- 
mate the transposition history. 

"pure birth" model (constant transposition rate and no dele- 
tion) (table 1). The estimated transposition rates across the TE 
families are quite similar, between 0.09 and 0.16 per percen- 
tage of divergence. Nevertheless, the dynamics of these four 
families are not identical, since the birth-death model 
(allowing both transposition and deletion) could detect a non- 
null deletion rate for Fot5, whereas no significant deletions 
could be identified for the other families. 

The resulting LTT (or more exactly, lineage-through-diver- 
gence) plots (fig. 4) suggest important departure from simple 
models. The curves for all Fot families are above the "pure 
birth" prediction, which suggests that the past rate of dupli- 
cation per copy was higher than the current one. To check for 
changes in the transposition rate, we fit models in which 
transposition rates vary exponentially with time. Figure 5 illus- 
trates the resulting dynamics, as well as the 95% support 
intervals calculated from bootstrapped phylogenies. At least 
two TE families show clear changes in their transposition 
dynamics: in Fot2 and Fot6, the transposition rate tends to 
decrease with time. The slightly decreasing trends for Fot3 and 
Fot5 are not supported statistically. 

Finally, we exploited an existing model for diversity-depen- 
dent speciation to test the hypothesis of transposition regula- 
tion. Transposition regulation assumes that the transposition 
rate decreases with the number of copies, which is necessary 
to avoid an exponential invasion of TEs in genomes. The 
model developed by Etienne et al. (2012) assumes that the 
"ecosystem" (in our case, the genome) has a carrying capacity 
K, so that the transposition rate varies with 1 — n/K, where n 



is the number of TE copies of the family under consideration. 
For all four TE families, the diversity dependent model signifi- 
cantly outperforms the birth-death model, with Akaike 
Information Criterion (AIC) differences ranging from 15 
units {Fot2) to 87 units {Fot5). However, estimated carrying 
capacities (the number above which transposition would stop 
completely) were well above the observed number of copies 
(Fot2, FofS, Fot5, and Fot6 occupy only 8%, 5%, 13%, and 
4% of their theoretical niche, respectively). Although statisti- 
cally significant, diversity-dependence remains moderate, and 
affects the transposition rate only marginally (the current 
transposition rate for all families is more than 85% of the 
estimated initial transposition rate when one copy only was 
present in the genome). This result supports the idea that 
transposition regulation by the number of copies is not 
strong enough to allow for a stable transposition-deletion 
equilibrium, although interpretation is obscured by the pre- 
sence in the genome of TE copies caught in segmental dupli- 
cations, which were not included in the phylogenetic analysis, 
but which could be involved in regulation. 

Phylogenetic Tree Balance 

The p index for tree imbalance was computed as detailed in 
the Materials and Methods section. ML estimates of p, as well 
as 95% support intervals calculated from 500 bootstraps, 
were as follows: p Fot2 = -1 .02 (-1 .75, 4.04), P Fot3 = 
-1.01 (-1.61, 0.35), p Fot5 = -1.03 (-1.30, - 0.70), 
and p Fot6 = -1.16 (-1 .78, - 0.20). The estimates of tree 
imbalance are thus very similar across the four TE families, 
estimates being more precise in larger trees. All p estimates 
are consistent with random trees. Tree imbalance is intermedi- 
ate between the two extreme models of random trees 
(the Yule process or ERM model, P = 0, and the uniform 
PDA model, p = -1 .5). Fot5 and Fot6 trees exclude a Yule 
process as a generating mechanism (P = 0 being outside of 
the support interval), suggesting that the actual transposition 
rate differs across clades. However, the "master copy" 
hypothesis, which generates highly imbalanced trees 
(P<-1.5), can be statistically rejected for most families. 
Alternative indexes (Colless and Sackin indexes, as implemen- 
ted in the package "apTreeshape," Bortolussi et al. 2006) 
provided identical results (tree imbalance intermediate 
between ERM and PDA models, not shown). 

Discussion 

Transposition Dynamics 

With this article, our intention is to demonstrate how the 
phylogenetic pattern of repeated genomic sequences could 
be analyzed in terms of temporal dynamics. We showed that 
different transposition dynamics lead to different distributions 
of transposition events, and that it was possible to derive 
models to reconstruct transposition history from available 
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Fig. 3. — ML reconstructed phylogenies for the four Fot subfamilies. Trees were rooted with the other subfamilies. Ultrametric trees were obtained 
through the "pathd8" algorithm (see "Materials and Methods"). Asterisks (*) denote nodes that are supported by bootstrap scores >50. 



sequence data, based on a quantitative statistical framework 
used for species phylogenies. 

We believe that this strategy represents a significant 
improvement compared with the state of the art in genomics. 
The literature reports several ways to interpret phylogenetic 
and divergence data in similar contexts (Ray et al. 2008; Zerjal 
et al. 2009; Cordaux et al. 2010; Han et al. 2010; Dufresne 
et al. 201 1). However, most of these methods are not devoid 
of limitations, biases, or caveats. Frequently, the age of a TE 
family is calculated as the average distance between copies 
and a consensus sequence (supposedly close to the ancestral 
sequence). Yet, this procedure does not allow the exploration 
of within-family dynamics. This issue is sometimes overcome 
by assuming several successive transposition bursts (Pace and 
Feschotte 2007), which is restricted to TE families with many 
copies. Visual comparison of tree topologies is qualitative 
only, and information about absolute branch lengths is disre- 
garded. Alternatively, the distribution of pairwise distances 
between copies may provide quantitative results, but ancient 



transposition events (deep and bushy nodes in the tree) are 
counted several times, which severely hinders data interpreta- 
tion. These approaches are difficult to apply to other species or 
TE families with smaller copy number or different transposition 
activity, and are probably not suitable for systematic explora- 
tion of available data. An exception lies in the ingenious 
method proposed by SanMiguel et al. (1998), which consists 
in estimating the insertion date of retro-elements based on the 
similarity between their two long-terminal repeats (LTRs), 
strictly identical after transposition. Unfortunately, this strat- 
egy can be applied only to complete LTR retro-elements, and 
remains associated with large sampling errors due to the small 
size of LTR sequences. 

Model Limits 

The dynamics of TE sequences in genomes remain quite a 
complex process, and a simple model necessarily relies on 
approximations. In particular, quantifying the statistical error 
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in phylogenetic analysis is known to be a complex issue 
(Felsenstein 1988; Wrobel 2008; Kumar et al. 2012), because 
errors are both quantitative (branch lengths) and qualitative 
(tree topology, selection of the evolutionary model). Here, we 



Table 1 

Estimates of the Diversification Rate r = 
Model and in the "Birth-Death" Model 
Fraction a = v/u Is Also Provided) 



u - v in the "Pure Birth" 
(for Which the Extinction 



Pure Birth 



Birth-Death 



Fot 2 






r 


0.155 (0.145, 0.168) 


0.161 (0.144, 0.175) 


a 




0.000 (0.000, 0.000) 


u 


0.155 (0.145, 0.168) 


0.161 (0.144, 0.175) 


V 




0.000 (0.000, 0.000) 


Fot 3 






r 


0.118 (0.111, 0.124) 


0.121 (0.091, 0.126) 


a 




0.000 (0.000, 0.004) 


u 


0.118 (0.111, 0.124) 


0.121 (0.112, 0.148) 


V 




0.000 (0.000, 0.051) 


Fot 5 






r 


0.118 (0.111, 0.125) 


0.092 (0.067, 0.094) 


a 




0.004 (0.004, 0.006) 


u 


0.118 (0.111, 0.125) 


0.157 (0.155, 0.197) 


V 




0.065 (0.063, 0.126) 


Fot 6 






r 


0.122 (0.114, 0.130) 


0.126 (0.109, 0.134) 


a 




0.000 (0.000, 0.000) 


u 


0.122 (0.114, 0.130) 


0.126 (0.109, 0.134) 


V 




0.000 (0.000, 0.000) 



Note. — 95% support intervals, calculated from 100 bootstrapped trees, are 
indicated between parentheses. Estimates of u and v calculated from r and a 
are also provided, r, u, and v are expressed in "events per percentage of 
divergence," whereas a is unitless. 



estimated errors using the same resampling strategy as for 
phylogeny: confidence intervals of, for example, transposition 
rates were derived from the distribution of estimated rates 
obtained by running the model on a large number of boot- 
strapped trees. This time-consuming resampling strategy has 
the advantage to be applicable to any phylogenetic recon- 
struction method. 

However, estimating the sampling noise associated to para- 
meter estimates does not inform about potential biases. 
Estimates of transposition dynamics are reliable only if the 
models on which they are based are good approximations 
of the real processes, including sequence alignment, phyloge- 
netic reconstruction, tree datation, and transposition model. 
A critical step here is the estimation of an ultrametric tree 
(in which all tips are aligned and distances scale with time) 
from an ML tree with different branch lengths. The evolution- 
ary rate of TE sequences is not very well understood, and is 
known to vary dramatically between TE clades, due to, for 
example, sequence inactivation (equivalent to pseudogeni- 
zation), or more specifically in our example, repeat-induced 
point mutations, a fungus-specific regulation mechanism 
(Cambareri et al. 1 989; Galagan and Selker 2004). Tree topol- 
ogy can also be affected by various biases; for instance, simu- 
lation studies show that poor data tend to generate 
imbalanced trees (see Mooers and Heard 1997 for review). 
The estimated branching dynamics (branch length and topol- 
ogy) thus rely on the robustness of a series of biological 
assumptions; improving the phylogenetic reconstruction 
(e.g., by implementing TE-specific features) may thus improve 
significantly the reliability of the inferred transposition history. 
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Fig. 4. — Lineage-through-divergence plots for the four Fot subfamilies. The dashed line illustrates the expectation for a "pure birth" model (constant 
transposition, no deletions). 



Genome Biol. Evol. 5(1)77-86. doi:10.1093/gbe/evs130 Advance Access publication December 29, 2012 



83 



Le Rouzic etal. 



GBE 



Fot2 
Fot3 
Fot5 
Fot6 



fe 



If 



-0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 
Divergence 

Fig. 5. — Illustration of the estimated ML exponential dynamics (dots), 
and the corresponding 95% support intervals from 100 bootstrapped 
trees. 



Although powerful and widely used in phylogenetics, 
branching models should be interpreted carefully. One of 
the most problematic issues is the lack of power to compute 
the extinction rate (v in our case) compared with the net 
diversification rate (r = u - v), up to the point that some 
authors consider that extinction rates should not be estimated 
at all from phylogenies (Rabosky 2010). In our examples, a 
significant (but relatively small) deletion rate could be detected 
for one out of four Fot families. The estimated value of v is 
realistic, but alternative interpretation could be proposed, 
such as a recent increase in the transposition rate. More 
robust estimates of transposition rates could be obtained 
from more extensive data, for example, by comparing ortho- 
logous insertion sites between close species. 

Interpreting variation of the transposition rate may also 
depend on the detailed nature of TEs. Here, we present an 
example based on cut-and-paste, class II TEs. In Fot elements, 
tree topologies appear to be roughly balanced, and most 
copies are able to transpose and to generate new branches 
in the tree, supporting (at least partially) the exponential Yule 
model. This pattern appears to be widespread for TE phylo- 
genies (Cordaux et al. 2004). However, other TEs (such as class 
I elements) are known to generate a high proportion of "dead 
on arrival" copies after transposition (i.e., most transposition 
events are asymmetric and generate a nonfunctional copy), 
resulting in an extremely imbalanced tree. Therefore, in the 
latter case, known as the "master copy" model (Clough et al. 
1996; Brookfield and Johnson 2006; Johnson and Brookfield 
2006), the evolutionary dynamics should not be necessarily 
interpreted as a drop in transposition activity as long as the 
transposition rate per genome remains constant, even if the 
transposition rate per copy mechanically decreases with time. 
Both tree topology and branching dynamics, although almost 



independent statistically, thus provide complementary infor- 
mation to reconstruct the evolutionary history of repeated 
sequences. 

Perspectives 

A natural (yet, not trivial) evolution of the model should 
account for the activity of TE sequences. In general, genome 
scans reveal at least three functional categories: active copies 
(canonical elements), relic copies (equivalent to pseudogenes), 
and nonautonomous copies (unable to code for the transposi- 
tion machinery, but mobile when trans-mobilized). Simulation 
models have shown that the relative proportion of each kind 
of copies may affect significantly the dynamics of the whole TE 
family (Le Rouzic et al. 2007; Boutin et al. 201 2). Ideally, such a 
TE-specif ic evolutionary model should be taken into account in 
the phylogenetic reconstruction, including, for example, dif- 
ferent mutation rates depending on the status of the copy, as 
well as the location of pseudogenization events in the tree 
based on the observed status of the sequences and the tree 
topology. Yet, implementing such a model may require deep 
changes in the phylogenetic algorithm. 

Another issue with the most recent duplication events is 
that the branching model ignores recent population genetics 
mechanisms (such as natural selection against slightly deleter- 
ious TE copies), and that the phylogeny reconstructed from a 
single individual genome might provide a biased view of the 
recent transposition history. There is little doubt that, along 
with progress in sequencing, the genome of several individuals 
per species will be available soon as it is already the case with 
model species, which is likely to help fixing this issue (provided 
a suitable theoretical framework). 

In any case, the nature of the genomic data makes it pos- 
sible to obtain independent estimates of parameters of inter- 
est, which could validate phylogenetic models, or be used as 
fixed parameters to derive more complex models. For 
instance, deletion rates can be independently estimated by 
identifying and dating deletion events from TEs inserted in 
duplicated parts of the genome, which were not included in 
the phylogeny. The robustness of the procedure could also be 
improved by dating some of the tree nodes, by comparing 
insertions shared by close species, and inferring transposition 
timing based on estimates of speciation events from fossil data 
or phylogenies of conserved genes. 

Reconstructing the activity dynamics of TEs from genome 
sequences thus requires to combine tools from bioinformatics, 
phylogenetic analysis, and population genetics. Here, we pro- 
vide a methodological framework to estimate and interpret 
the pattern of transposition activity, using the statistical frame- 
work developed to infer speciation and extinction dynamics in 
species phylogenies. This framework can be complexified, and 
makes it possible to derive more efficient procedures and 
more realistic models. Given the rapid accumulation of new 
genome sequences, the development of a new set of tools 
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devoted to the study of repeated sequences appears as one of 
the keys for improving the efficiency of the analysis of such 
massive, costly, and informative data. 
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